Skip to main content

Finding outliers

Outlier detection is identification of data points that are significantly different from other values in the data set. For example, outliers could be errors or unusual entities in a data set. Outlier detection is an unsupervised machine learning technique, there is no need to provide training data.

!!! Outlier detection is a batch analysis, it runs against your data once. If new data comes into the index, you need to do the analysis again on the altered data. !!!

Outlier detection algorithms

In the Elastic Stack, we use an ensemble of four different distance and density based outlier detection methods:

distance of Kth nearest neighbor: computes the distance of the data point to its Kth nearest neighbor where K is a small number and usually independent of the total number of data points. distance of K-nearest neighbors: calculates the average distance of the data points to their nearest neighbors. Points with the largest average distance will be the most outlying. local outlier factor (lof): takes into account the distance of the points to their K nearest neighbors and also the distance of these neighbors to their neighbors. local distance-based outlier factor (ldof): is a ratio of two measures: the first computes the average distance of the data point to its K nearest neighbors; the second computes the average of the pairwise distances of the neighbors themselves. You don’t need to select the methods or provide any parameters, but you can override the default behavior if you like. Distance based methods assume that normal data points remain closer or similar in value while outliers are located far away or significantly differ in value. The drawback of these methods is that they don’t take into account the density variations of a data set. Density based methods are used for mitigating this problem.

The four algorithms don’t always agree on which points are outliers. By default, outlier detection jobs use all these methods, then normalize and combine their results and give every data point in the index an outlier score. The outlier score ranges from 0 to 1, where the higher number represents the chance that the data point is an outlier compared to the other data points in the index.

Feature influence

Feature influence – another score calculated while detecting outliers – provides a relative ranking of the different features and their contribution towards a point being an outlier. This score allows you to understand the context or the reasoning on why a certain data point is an outlier.

1. Define the problem

Outlier detection in the Elastic Stack can be used to detect any unusual entity in a given population. For example, to detect malicious software on a machine or unusual user behavior on a network. As outlier detection operates on the assumption that the outliers make up a small proportion of the overall data population, you can use this feature in such cases. Outlier detection is a batch analysis that works best on an entity-centric index. If your use case is based on time series data, you might want to use anomaly detection instead.

The machine learning features provide unsupervised outlier detection, which means there is no need to provide a training data set.